Home Credit Default Risk

Team Members

image.png

1.0 FPGroupN 11 HCDR

1.1 Phase Leader Plan

image.png

1.2 Credit Assignment Plan

image.png image-2.png

image.png image-2.png

1.3 Abstract

Based on historical credit histories and repayment trends utilizing machine learning modeling, Home Credit offers unsecured lending. A user-generated credit score is calculated using criteria like the balance that the user has maintained. As part of this project, we are predicting the customer repayment status such as if the user is a defaulter or not using machine learning pipelines and models using the datasets provided by Kaggle. The data collection includes seven separate tables that aid in determining the user status, including bureau balance, credit card balance, home credit column detection, Installments payments, POS CASH balance, and previous applications. In phase 2, we provide feature engineering, EDA and modelling pipelines. We experimented with categorizing baseline inputs and choosing features for Decision Trees, Random Forests, and Logistic Regression. The Random Forest baseline pipeline has the highest test accuracy, followed by Logistic Regression, then Decision Making tree, and finally Lasso and Ridge being the least accurate.

1.4 Data and Task Description

image-7.png

1.5 Gantt Chart

image.png

1.6 Machine Learning Algorithms and Metrics

The outcome of this project is to predict, whether the customer will repay the loan or not. That’s why this is a classification task where the outcome is 0 or 1. To classify this problem we will be building the following machine-learning models:

  1. Logistics Regression:
    • In our case, the number of features is relatively small i.e. <1000, and no. of examples is large. Hence logistic regression can be a good fit here for the classification.

  2. Decision Tree:
    • Decision trees are better for categorical data and our target data is also categorical in nature that’s why decision trees are a good fit.

  3. Random Forest:
    • Random Forest works well with a mixture of numerical and categorical features. • As we have a good amount of mixture of both types of features random forest can be a good fit.

1.6.1 Loss Function

1.6.2 Metrics

  1. Confusion Metrics:
    • A confusion matrix, also called an error matrix, is used in the field of machine learning and more specifically in the challenge of classification. Confusion matrices show counts between expected and observed values. The result "TN" stands for True Negative and displays the number of negatively classed cases that were correctly identified. Similar to this, "TP" stands for True Positive and denotes the quantity of correctly identified positive cases. The term "FP" denotes the number of real negative cases that were mistakenly categorized as positive, while "FN" denotes the number of real positive examples that were mistakenly classed as negative. Accuracy is one of the most often used metrics in classification.

image.png

  1. AUC:
    • AUC stands for "Area under the ROC Curve." It measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). It is a widely used accuracy method for binary classification problems.
  2. Accuracy:
    • The accuracy score is used to gauge the model's effectiveness by calculating the ratio of total true positives to total true negatives across all made predictions. Accuracy is generally used to calculate binary classification models.

1.7 Machine Learning Pipeline Steps

image.png

1.8 Block Diagram

image.png

EXPLORATORY DATA ANALYSIS

GENDER Vs INCOME based on Target

OWN HOUSE COUNT based on Target

OWN CAR COUNT based on Target

BORROWER OWNING A CAR are more likely to Pay

-------------------------------------------------------------------------------------------------------

OCCUPATION TYPE COUNT based on Target

OCCUPATION TYPE vs INCOME based on Target

Defaulter Percentage is less than IC_ratiois either low or High

-----------------------------------------------------------------------------------------------------------

REPAYERS TO APPLICATION RATIO

CORRELATION OF POSITIVE DAYS SINCE BIRTH AND TARGET

CORRELATION OF POSITIVE DAYS SINCE EMPLOYMENT AND TARGET

FETCHING IMPORTANT RELAVENT FEATURES

Result and Discussion

As experiment log describes the accuracy, AUC, and loss of baseline machine learning model logistic regression, Decision Tree, and random forest. For the baseline logistic regression model, we can see that the train (92.0) and test (91.9) accuracy is on the higher side, which means logistic regression is performing well on the provided dataset. The log loss for logistic regression is on the lower side which is 0.25 which means most of our predictions are correct. For logistic regression the AUC of 0.5 is significant and the accuracy of 91.94 is on the higher side. So, the algorithm is performing well for given inputs. Both Random Forest and logistic regression have approximately the same train and test accuracy and log loss. But baseline Random Forest remains the best-fit algorithm as it beats the logistic regression by a very small margin in all the criteria. We observed a slight increase of 0.02 in AUC, 0.3 in test accuracy, and 0.3 in overall accuracy. The log loss for random forest (0.27) is on the lower side and hence it beats the baseline logistic regression model. The decision tree has comparatively low train and test accuracy. This loss of test accuracy may be due to the short dept of the decision tree, compared to the number of variables we have used even though we see the drop in test accuracy, in return, we can see a little increase of 0.07 in AUC as compared to baseline random forest and logistic regression. So, the decision tree can also be a good fit for a given dataset, but we might need to finetune our datasets a little extra.

image.png

Conclusion

The HCDR project's goal is to forecast the population's capacity for payback among those who are underserved financially. Because both the lender and the borrower want reliable estimates, this project is crucial. Real-time Home credit's ML pipelines, which acquire data from the data sources via APIs, run EDA, and fit it to the model to generate scores, which allows them to present loan offers to their consumers with the greatest amount and APR. Hence if NPA expected to be less than 5% in order to maintain a profitable firm, risk analysis becomes extremely important. Credit history is an indicator of a user's trustworthiness that is created using parameters such as the average, minimum, and maximum balances that the user maintains, Bureau scores that are reported, salary, etc. Repayment patterns can be analysed using the timely defaults and repayments that the user has made in the past. Other criteria such as location information, social media data, calling/SMS data, etc. are included in alternative data. As part of this project, we would create machine learning pipelines, do exploratory data analysis on the datasets provided by Kaggle, and evaluate the models using a variety of evaluation measures before deploying one. Phase 2 involved the estimation of several models. Data imputation and feature selection were done. We started by selecting features and imputed values. The values of certain features that were missing were filled in. Then, based on our past understanding, we chose to include pertinent features. We trained and assessed several models, including Random Forest, Decision Tree Model, and Logistic Regression, to discover the best one. We have concluded from phase 2 that the decision tree model is unable to defeat the baseline model. The random forest model performs the best out of all the models. In phase 3 we plan to implement all models through hyper-tuning of their parameters.